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Abstract 


The  objective  of  this  study  is  to  evaluate  the  predictive  validity  of  the  Capability  Maturity 
Model®  (CMM®)  for  Software  (SW-CMM)  as  applied  to  software  maintenance. 

The  SW-CMM  is  intended  to  apply  to  both  software  development  and  maintenance.  A  basic 
premise  (hypothesis)  of  the  SW-CMM  is  that  improving  process  maturity  will  result  in  better 
project  performance  and  product  quality.  The  extent  to  which  that  hypothesis  is  supported 
empirically  is  called  a  test  of  its  predictive  validity.  No  previous  evaluation  exists  of  the 
predictive  validity  of  the  SW-CMM  in  a  maintenance  context. 

The  extent  to  which  schedule  estimates  differ  from  reality  is  one  important  measure  of 
project  performance.  But  is  higher  maturity  in  fact  correlated  with  a  reduction  in  schedule 
deviation?  Data  from  752  maintenance  projects  drawn  from  441  SW-CMM  assessments  are 
analyzed  using  a  zero  inflated  Poisson  (ZIP)  regression  model,  and  the  results  are  validated 
using  a  bootstrap  estimation  method.  Projects  from  higher  maturity  organizations  typically 
report  less  schedule  deviation  than  those  from  organizations  assessed  at  lower  maturity 
levels. 


Capability  Maturity  Model  and  CMM  are  registered  in  the  U.S.  Patent  and  Trademark  Office  by 
Carnegie  Mellon  University. 
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1  Introduction 


1.1  The  Importance  of  Software  Maintenance 

The  Capability  Maturity  Model®  (CMM®)  for  Software  (SW-CMM)  [Paulk  et  al.  93a-93c, 
Paulk  et  al.  96]  cites  the  definition  of  maintenance  from  IEEE  Std  610-1990  [IEEE  90]  as 
“the  process  of  modifying  a  software  system  or  component  after  delivery  to  correct  faults, 
improve  performance  or  other  attributes,  or  adapt  to  a  changed  environment.”  This  definition 
includes  at  least  three  types  of  software  maintenance: 

1.  corrective  maintenance:  To  correct  processing,  performance,  or  implementation  faults 
of  the  software. 

2.  adaptive  maintenance:  To  adapt  the  software  to  changes  in  environment  such  as  new 
hardware  of  the  next  release  of  an  operating  system.  Adaptive  maintenance  does  not  lead 
to  changes  in  the  system’s  functionality. 

3.  perfective  maintenance:  To  perfect  the  software  for  its  performance,  processing 
efficiency,  maintainability,  or  accommodation  of  new  or  changed  user  requirements. 

The  IEEE  has  estimated  the  annual  cost  of  software  maintenance  in  the  United  States  to 
exceed  $70  billon  [Edelstein  93,  Lefner  94].  Schrank  has  estimated  it  to  be  more  than  $30 
billion  annually  [Schrank  et  al.  95],  Others  have  estimated  the  magnitude  of  software 
maintenance  costs  to  range  from  40  to  80  percent  of  overall  software  life-cycle  costs 
[Alkhatib  92,  Kemerer  95,  Schrank  et  al.  95].  A  widely  used  rule  of  thumb  for  the  distribution 
of  maintenance  activities  has  been  60  percent  for  enhancements,  20  percent  for  adaptation, 
and  20  percent  for  error  correction  [Lientz  &  Swanson  80,  Glass  &  Noiseux  81]. 

While  the  SW-CMM  is  intended  to  be  suited  for  both  development  and  maintenance 
processes,  difficulties  in  implementing  the  model  in  maintenance-only  organizations  have 
been  reported  [Drew  92],  Others  have  criticized  the  SW-CMM  for  not  directly  addressing 
maintenance  [Kuilboer  &  Ashrafi  00].  One  survey  study  conducted  in  the  United  Kingdom 
failed  to  find  evidence  that  higher  maturity  companies  manage  maintenance  more  effectively 
than  lower  maturity  companies;  however,  the  survey  does  not  explicitly  state  how  it  defines 
maturity  [Hall  et  al.  01].  Swanson  and  Beath  claimed  that  software  maintenance  is 
fundamentally  different  from  development  of  new  systems  since  the  maintainer  must  interact 
with  an  existing  system  [Swanson  &  Beath  89]. 


®  Capability  Maturity  Model  and  CMM  are  registered  in  the  U.S.  Patent  and  Trademark  Office  by 
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Niessink  and  van  Vliet  investigated  the  difference  between  software  maintenance  and 
software  development  from  a  service  point  of  view  [Niessink  &  van  Vliet  00].  They  argued 
that  software  maintenance  can  be  seen  as  providing  a  service,  while  software  development  is 
concerned  with  the  development  of  products.  Hence,  they  developed  a  separate  information 
technology  (IT)  service  Capability  Maturity  Model  meant  for  software  maintenance 
organizations  and  other  IT  service  providers.  Similarly,  Kajko-Mattsson  developed  a  problem 
management  maturity  model  for  corrective  maintenance  [Kajko-Mattsson  02]. 


1.2  This  Study 

A  basic  premise  of  the  SW-CMM  is  that  higher  process  maturity  is  associated  with  better 
project  performance  and  product  quality.  Furthermore,  improving  maturity  is  expected  to 
subsequently  improve  both  performance  and  quality.  Testing  this  premise  can  be  considered 
an  evaluation  of  the  predictive  validity  of  the  assessment  measurement  procedure  [El-Emam 
&  Goldenson  95].  Given  both  the  high  cost  of  software  maintenance  and  enduring  questions 
about  the  applicability  of  the  SW-CMM,  it  is  important  to  provide  objective  evidence  about 
the  predictive  validity  of  the  SW-CMM  in  a  maintenance  context. 

This  study  provides  evidence  that  higher  process  maturity  is  in  fact  associated  with  “reduced 
mean  and  variance”  of  schedule  deviation  in  software  maintenance.'  The  analysis  is  based  on 
752  maintenance  projects  from  441  CMM-Based  Appraisals  for  Internal  Process 
Improvement  (CBA IPI)  assessments.  A  zero  inflated  Poisson  (ZIP)  regression  model  is  used 
to  account  for  nonnegative  integer  values  and  the  existence  of  multiple  reports  of  no 
deviations  in  schedule.  The  results  are  validated  using  a  bootstrap  estimation  method. 

Section  2  reviews  previous  studies  on  predictive  validity  and  presents  the  study’s  hypotheses. 
Section  3  addresses  data  collection  and  the  characteristics  of  our  sample.  Section  4  presents  a 
brief  introduction  of  a  ZIP  regression  model  and  a  bootstrap  method  for  examining  the 
stability  of  our  results.  Section  5  presents  the  results  of  the  analysis.  Section  6  contains  our 
conclusions  and  final  remarks. 


While  the  results  are  similar  across  the  software  development  life  cycle,  important  distinctions  will 
be  addressed  in  a  subsequent  study. 
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2  Empirical  Hypotheses  of  Predictive  Validity 


2.1  Theoretical  Basis 

The  SW-CMM  provides  a  framework  for  organizing  software  processes  into  five 
evolutionary  steps,  or  maturity  levels,  which  lay  successive  foundations  for  continuous 
process  improvement  (Table  1).  The  SW-CMM  covers  practices  for  planning,  engineering, 
and  managing  software  development  and  maintenance.  More  mature  software  organizations, 
when  following  these  key  practices,  are  expected  to  be  better  able  to  meet  their  cost, 
schedule,  functionality,  product  quality,  and  other  performance  objectives  [Paulk  et  al.  96]. 


Table  1:  Maturity  Levels  and  their  Key  Process  Areas  [Paulk  99] 


Level 

Focus 

Key  Process  Areas 

Level  5 

Optimizing 

Continuous  process  improvement 

-  Defect  Prevention 

-  Technology  Change  Management 

-  Process  Change  Management 

Level  4 

Managed 

Product  and  process  quality 

-  Quantitative  Process  Management 

-  Software  Quality  Management 

Level  3 
Defined 

Engineering  processes  and 
organizational  support 

-  Organization  Process  Focus 

-  Organization  Process  Definition 

-  Training  Program 

-  Integrated  Software  Management 

-  Software  Product  Engineering 

-  Intergroup  Coordination 

-  Peer  Review 

Level  2 

Repeatable 

Project  management  processes 

-  Requirements  Management 

-  Software  Project  Planning 

-  Software  Project  Tracking  and  Oversight 

-  Software  Subcontract  Management 

-  Software  Quality  Assurance 

-  Software  Configuration  Management 

Level  1 
Initial 

Competent  people  (and  heroics) 

Testing  the  above  basic  premise  of  the  SW-CMM  requires  an  empirical  evaluation  of  the 
predictive  validity  of  the  process  maturity  concept.  Is  there  a  characteristic  relationship 
between  process  maturity  and  independently  measured  performance  criteria?  Clearly,  such 
relationships  may  depend  on  other  contextual  factors;  that  is,  the  relationships  may  differ 
from  one  context  to  another  or  may  exist  in  only  a  few  contexts.  This  theoretical  basis  for 
evaluating  predictive  validity  is  depicted  in  Figure  1 . 
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Process  maturity 


Performance  (objectives) 


Contextual  factors 


Figure  1:  Theoretical  Basis  in  a  Predictive  Validity  Study 

2.2  Variable  Definition  and  Empirical  Hypotheses 

In  the  context  of  software  maintenance  in  Figure  1,  schedule  deviation  is  the  performance 
measure  we  use  as  our  dependent  variable.  Schedule  deviation  is  defined  as  the  absolute 
value  of  the  difference  between  actual  schedule  and  planned  schedule,  i.e.,  y  =|Actual- 

Planned|.  Schedule  deviation  y  is  expressed  in  months  ahead  or  behind  schedule,  with  a 
value  of  zero  indicating  that  the  project  is  on  schedule.2 

Our  explanatory  variable,  process  maturity,  is  coded  from  maturity  level  5  down  to  maturity 
level  1.  Maturity  level  is  an  ordinal  scale,  not  an  interval  scale;  however,  we  do  employ 
parametric  statistics  in  this  analysis. 

Previous  studies  show  that  the  distribution  of  maturity  levels  differs  between  the  United 
States  and  elsewhere  in  the  world  [SEI 02,  Jung  et  al.  02].  Hence,  we  examine  how  the  region 
where  the  assessment  was  conducted  (U.S.  versus  non-U.S.)  acts  as  a  contextual  factor  in 
mediating  the  effects  of  our  research  hypotheses.3 

The  theoretical  basis  shown  in  Figure  1  implies  that  schedule  deviation  is  negatively 
associated  with  maturity  level:  the  higher  the  maturity  level,  the  less  schedule  deviation.  In 
addition,  the  association  may  differ  across  regions  of  the  world.  Two  types  of  benefits  are 
expected  to  follow: 


One  might  argue  that  being  ahead  of  schedule  is  less  serious  than  being  behind  schedule;  however, 
too  few  projects  reported  being  ahead  of  schedule  to  allow  a  separate  analysis  here.  Other 
weaknesses  of  the  schedule  deviation  measure  are  described  in  Section  3.3. 

Classical  measurement  theory  posits  that  variables  should  be  measured  on  at  least  an  interval  scale 
to  permit  the  computation  of  the  mean  and  related  parametric  statistics  [Stevens  51,  Nunnally  & 
Bernstein  94],  but  using  only  nonparametric  methods  on  non-interval  scale  data  would  exclude 
much  useful  study  [Nunnally  &  Bernstein  94].  Hence,  many  authors  argue  that  a  useful  study  can 
be  conducted  even  if  the  proscriptions  are  violated  [Briand  et.  al.  96,  Gardner  75.  Stevens  51, 
Velleman  &  Wilkinson  93],  El-Emam  and  Birk  provide  a  detailed  discussion  of  the  scale  type  issue 
in  studies  of  process  capability  and  maturity  [El-Emam  &  Birk  00a-00b]. 
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•  HYPOTHESIS  1:  Increasing  maturity  level  reduces  the  mean  of  schedule  deviance  in 
maintenance  projects. 

•  HYPOTHESIS  2:  Increasing  maturity  level  reduces  the  variance  of  schedule  deviance  in 
maintenance  projects. 

Testing  these  two  hypotheses  in  software  maintenance  projects  allows  us  to  evaluate  the 
predictive  validity  of  the  process  maturity  concept.  The  same  two  hypotheses  have  been 
depicted  elsewhere  in  graphical  form  as  seen  in  Figure  2  [Paulk  et  al.  93a-93c,  Paulk  et  al. 
96]. 


Performance  continuously 
improves  In  Level  5 
organizations 


Baaed  on  quantitative 
understanding  of  process 
and  product,  performance 
continues  to  improve  In 
Level  4  organizations 


With  well-defined  processes, 
performance  improves  In 
Lave)  3  organizations 


Tim®/*/... 


Plans  based  on  past 
performance  are  more 
realistic  in  Level  2 
organizations 


S 

I 


Schedule  and  cost  targets 
are  typically  overrun  by 
Level  1  organizations. 


Figure  2:  Process  Capability  as  Indicated  by  Maturity  Level  [Paulk  et  al.  93a] 


2.3  Previous  Empirical  Studies 

All  previous  studies  of  predictive  validity  in  process  improvement  are  based  either  implicitly 
or  explicitly  on  the  theoretical  model  depicted  in  Figure  1.  While  some  empirical  studies 
examine  variation  across  large  numbers  of  organizations,  most  of  them  are  case  studies  that 
describe  the  experiences  and  benefits  from  increasing  process  maturity  in  a  single 
organization  or  a  small  number  of  organizations. 

Case  studies  are  quite  useful  for  demonstrating  proof  of  concept.  There  clearly  are 
organizations  that  have  benefited  from  increased  process  maturity  [Brodman  &  Johnson  02, 
Butler  95,  Diaz  &  Sligo  97,  Daiz  &  King  02,  Dion  92  &  93,  Krasner  99]. 
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Case  studies,  however,  have  a  serious  methodological  disadvantage.  It  is  difficult  at  best  to 
generalize  their  results  to  a  wider  population.  A  case  study  can  monitor  projects  in  depth,  but 
it  is  difficult  to  replicate  the  results  later  in  a  comparable  context.  Case  studies  also  tend  to 
suffer  from  a  selection  bias  [adapted  from  El-Emam  &  Birk  00b]: 

•  Organizations  that  have  not  shown  any  process  improvement  or  have  even  regressed  will 
be  highly  unlikely  to  publicize  their  results,  so  case  studies  tend  to  show  mainly  success 
stories. 

•  The  majority  of  organizations  do  not  collect  objective  process  and  product  data  (e.g.,  on 
defect  levels,  or  even  keep  accurate  effort  records).  Only  organizations  that  have  made 
improvements  and  reached  a  reasonable  level  of  maturity  will  have  the  actual  objective 
data  to  demonstrate  improvements  (in  productivity,  quality,  or  return  on  investment). 
Therefore,  failures  and  non-movers  are  less  likely  to  be  considered  as  viable  case  studies 
due  to  the  lack  of  data. 

By  now,  several  predictive  validity  studies  have  collected  data  from  larger  numbers  of 
organizations  or  projects,  and  they  have  statistically  investigated  relationships  between 
capability  maturity4  and  independent  measures  of  performance.  A  survey  study  of  individuals 
from  SW-CMM-assessed  organizations  shows  that  higher  maturity  organizations  tend  to 
perform  better  on  the  subjective  measures  of  performance  (including  ability  to  meet 
schedule),  product  quality,  staff  productivity,  customer  satisfaction,  and  staff  morale 
[Goldenson  &  Herbsleb  et  al.  94,  Herbsleb  et  al.  97].  In  another  survey-based  study, 
Deephouse  and  colleagues  found  evidence  of  predictive  validity  in  the  relationships  among 
seven  software  processes  and  measures  of  project  performance  including  meeting  schedule 
and  budget  targets,  quality,  and  rework  [Deephouse  et  al.  95,  Deephouse  et  al.  96]. 

Lawlis  and  colleagues  investigated  the  benefits  of  the  SW-CMM  with  two  measures  extracted 
from  U.S.  Air  Force  contracts  [Lawlis  et  al.  96].  Their  results  show  that  higher  maturity 
projects  typically  perform  better  on  indices  of  both  cost  and  schedule  performance  than  do 
those  at  a  lower  maturity  level.  In  a  study  combining  questionnaire  data  with  existing  project 
metrics,  Krishnan  and  Kellner  found  that  SW-CMM-based  process  maturity  was  associated 
characteristically  with  a  reduction  in  delivered  defects  after  correcting  for  size  and  personnel 
capability  [Krishnan  &  Kellner  99]. 

El-Emam  and  Birk  evaluated  the  predictive  validity  of  the  ISO/IEC  15504  (Software  Process 
Assessment  [ISO/IEC  96])  capability  measure  for  four  software  processes:  “Develop 
Software  Requirements,”  “Develop  Software  Design,”  “Implement  Software  Design,”  and 
“Integrate  and  Test  Software”  [El-Emam  &  Birk  00a,  El-Emam  &  Birk  00b].  They  found  that 
the  “develop  software  design”  process  was  associated  with  several  project  performance 
measures.  Using  the  same  dataset,  Hwang  and  Jung  found  that  higher  project-management 
process  capability  is  related  to  increased  productivity  and  improved  morale  in  large 
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Studies  based  on  the  SW-CMM  have  examined  maturity  level  differences  in  performance.  Studies 
based  on  ISO/IEC  15504  (Software  Process  Assessment  [ISO/IEC  15504  96]),  which  uses  a 
continuous  representation,  have  examined  differences  in  process  capability. 
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organizations  [Hwang  &  Jung  03].  However,  much  weaker  relationships  were  found  between 
project-management  process  capability  and  any  of  the  performance  measures  in  small 
organizations. 
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3  Data 


3.1  Data  Collection 

3.1.1  Data  Source 

Authorized  lead  assessors  are  required  to  provide  reports  to  the  Software  Engineering 
Institute  (SEIsm)  for  their  completed  assessments.  Assessment  data  on  the  reports  are  kept  in 
an  SEI  repository  called  the  Process  Appraisal  Information  System  (PAIS).  The  PAIS 
includes  information  for  each  assessment  on  the  company  and  appraised  entity,  key  process 
area  (KPA)  profiles,  organization  and  project  context,  functional  area  representatives  groups, 
findings,  and  related  data.5 

This  report  considers  only  CBA IPI  assessments.  Not  all  CBA IPI  assessments  include  KPA 
rating  profiles,  since  the  determination  of  a  maturity  level  or  KPA  ratings  is  optional  and  is 
provided  at  the  discretion  of  the  assessment  sponsor.  The  dataset  that  we  analyzed  for  this 
study  was  extracted  from  appraisal  reports  in  the  PAIS  for  the  period  of  January  1998  through 
December  2001 . 


3.1.2  Dataset  Analyzed 

A  statistical  rule  of  thumb  states  that  there  should  be  at  least  six  observations  (sometimes 
five)  to  have  confidence  in  analysis  results.  A  similar  criterion  was  used  in  an  earlier  analysis 
of  software  process  assessment  [Jung  et  al.  01].  Briand  and  colleagues  [Briand  et  al.  00]  and 
El-Emam  and  colleagues  [El-Emam  et  al.  01]  also  have  used  a  “greater-than-five- 
observations”  criterion  for  the  validation  of  software  product  metrics. 

We  follow  the  same  rule  of  thumb  here.  Fewer  than  five  maintenance  projects  at  maturity 
levels  4  and  5  reported  any  schedule  deviation  whatsoever.  Hence,  we  exclude  maturity 
levels  4  and  5  from  our  statistical  analysis.  Note,  however,  that  the  lower  incidence  of 
reported  schedule  deviation  at  maturity  levels  4  and  5  is  of  course  entirely  consistent  with  our 
empirical  hypotheses. 


SM  SEI  is  a  service  mark  of  Carnegie  Mellon  University. 

Submitting  an  assessment  report  does  not  imply  that  the  SEI  certifies  any  assessment  findings  or 
maturity  levels.  All  assessment  data  are  kept  confidential  and  are  available  only  to  SEI  personnel 
on  a  need-to-know  basis  for  research  and  development.  Information  in  the  PAIS  is  used  to  produce 
industry  profiles  or  as  aggregated  data  for  research  publications,  and  the  SEI  publishes  a  Process 
Maturity  Profile  twice  a  year  (http://www.sei.cmu.edu/sema/profile.html). 
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Data  exist  for  752  maintenance  projects  from  441  organizations  assessed  at  maturity  levels  1 
through  3  inclusive.  Figure  3  shows  the  number  of  organizations  and  maintenance  projects 
assessed  by  region.  Since  more  than  one  maintenance  project  exists  in  some  organizations, 
the  number  of  organizations  is  fewer  than  the  number  of  projects. 


US  Non-US 

Figure  3:  Organizations  and  Maintenance  Projects  in  Regions 

Table  2  shows  the  number  of  assessed  maintenance  projects  at  each  maturity  level.  Schedule 
delays  were  reported  by  a  total  of  47  projects,  while  8  projects  reported  being  ahead  of 
schedule. 


Table  2:  Number  of  Maintenance  Projects 


Maturity  Level  1 

Maturity  Level  2 

Maturity  Level  3 

Total 

u.s. 

112  (12) 

222  (6) 

144  (5) 

478  (23) 

Non-U.S. 

42(6) 

155  (20) 

77  (6) 

274  (32) 

The  numbers  in  parentheses  denote  the  number  of  projects  that  reported  deviations  in  schedule. 

3.1.3  Unit  of  Analysis 

The  units  of  analysis  in  this  study  are  projects  in  the  maintenance  phase  of  their  life  cycles, 
and  our  performance  measure  is  schedule  deviation  expressed  in  months.  Since  the 
organization  typically  is  the  unit  of  analysis  in  CBA IPI  assessments,  our  measure  of  maturity 
is  organization-wide.  If  several  maintenance  projects  are  assessed  in  a  single  organization,  all 
of  the  projects  have  the  same  level  of  maturity  but  have  their  own  individual  values  of 
schedule  deviation. 
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3.2  Data  Quality 

Our  analysis  mostly  relies  on  two  variables.  The  independent  variable  (covariate)  is 
organizational  maturity  level  as  determined  by  CBA IPI  assessment  teams.  A  previous  study 
provides  ample  confidence  in  the  quality  of  that  measure.6  The  dependent  variable,  schedule 
deviation,  is  a  self-reported  nonnegative  integer  measured  by  month,  in  which  a  project  may 
be  ahead,  behind,  or  on  schedule.  Our  reliance  on  such  a  measure  raises  significant  accuracy 
issues. 

In  particular,  a  very  large  proportion,  approximately  95  percent,  of  the  projects  in  the 
maintenance  phase  of  their  life  cycles  reported  being  on  schedule.  That,  of  course,  is  contrary 
to  both  the  results  of  previous  studies  and  practical  experience  in  the  field. 

Several  reasons  may  account  for  this  divergence.  The  question  that  is  used  to  measure 
schedule  deviation  only  asks  whether  or  not  the  project  is  on  time,  but  the  criteria  for  being 
on  time  are  not  specified.  One  likely  conjecture  is  that  many  projects  periodically  modify 
their  baseline  schedule  estimates,  which  results  in  less-reported  delay.  Another  is  that 
assessments  often  include  exemplary  projects. 

Time  ahead  or  behind  schedule  is  measured  in  months,  so  there  also  is  most  probably 
rounding  error  in  the  projects’  replies.  If  a  maintenance  project  is  delayed  for  six  weeks, 
should  it  be  recorded  as  a  delay  of  one  month  or  two?  Similarly,  should  a  two-week  delay  be 
reported  as  a  one-month  delay  or  as  essentially  on  time?  Moreover,  the  measure  does  not 
account  for  variations  in  project  size  and  duration.  For  example,  a  two-month  delay  in  a  one- 
month  project  is  treated  the  same  as  a  two-month  delay  in  a  nine-month  project. 

That  said,  as  one  might  expect,  reported  schedule  deviation  is  in  fact  higher  for  projects  that 
are  in  other  phases  of  their  life  cycles  than  maintenance.  For  example,  more  than  25  percent 
of  the  projects  in  test  and  integration  do  report  being  a  month  or  more  behind  schedule. 

Self-reports  and  direct  observation  often  differ.  For  example,  one  study  shows  that  software 
engineers  over-report  the  amount  of  time  that  they  work  by  an  average  of  almost  three 
percent;  the  proportion  of  times  that  self-reports  and  observer  reports  agreed  on  what  the 
software  engineer  actually  was  doing  varied  substantially,  from  95  to  58  percent  [Perry  et  al. 
96].  Errors  in  self-reports  have  been  noted  in  various  other  studies,  including  voting  [Abelson 
et  al.  92],  receiving  of  health  care  [Loftus  et  al.  92],  and  doctor’s  visits  [McCallum  et  al.  95]. 

6  In  it  we  performed  an  internal-consistency  reliability  study  using  the  same  676  CBA  IPI 

assessments  on  which  the  present  work  is  based  [Jung  and  Goldenson  02],  The  results  identified 
three  underlying  dimensions  of  the  capability  maturity  construct.  “Project  implementation” 
includes  the  key  process  areas  (KPAs)  at  maturity  level  2,  “organization  implementation”  covers 
the  KPAs  at  maturity  level  3,  and  the  KPAs  at  both  maturity  levels  4  and  5  are  subsumed  under 
“quantitative  process  implementation.”  Cronbach’s  alpha  coefficient  of  internal  consistency  for 
each  of  the  three  dimensions  exceeds  the  recommended  value  of  0.9,  which  indicates  a  sufficiently 
high  level  of  internal  consistency  for  use  in  practice. 
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Every  measure  has  its  strengths  and  weaknesses.  For  example,  some  studies  recommend 
using  relative  measures.7  But  if  the  denominator  has  a  small  value,  the  measure  may  be 
exaggerated  and  take  on  an  unreasonably  large  value. 

Other  candidate  measures  of  schedule  deviation  include  arithmetic  means  and  standard 
deviations.  However,  they  too  are  subject  to  a  lack  of  robustness;  one  very  small  or  very  large 
value  causes  them  to  take  on  an  arbitrarily  large  value.  A  trimmed  method  such  as  a 
Winsorized  standard  deviation  or  median  absolute  deviation  might  be  used  [Lunneborg  00]; 
however,  such  methods  cannot  be  applied  to  a  dataset  characterized  by  excess  zeros  such  as 
ours. 

We  have  very  little  independent  basis  forjudging  the  criterion  validity  of  the  schedule 
deviation  question  per  se.  Moreover,  maturity  level  is  an  organizational  construct  while 
schedule  deviation  can  vary  by  project.  This  too  may  introduce  measurement  error  into  our 
analysis.  As  will  be  seen  later,  however,  our  results  are  robust  in  spite  of  these  limitations. 

The  relationships  with  maturity  level  provide  compelling  evidence  of  the  predictive  validity 
of  the  SW-CMM. 


3.3  Sampling  Characteristics  of  the  Dataset 

Statistical  analysis  and  its  interpretations  depend  on  the  criteria  by  which  a  sample  (subset)  is 
selected  from  a  population.  Classical  population  inference  requires  random  sampling.  Hence, 
we  examine  here  the  sampling  characteristics  of  our  dataset. 

The  simplest  form  of  sampling  is  a  random  sample.  A  simple  random  sample  is  defined  as  “a 
set  of  cases  selected  from  a  well-defined  population  of  cases  by  a  process  that  ensures  that 
every  sample  containing  the  same  number  of  cases  has  the  same  chance  of  being  the  one 
selected”  [Lunneborg  00].  In  the  context  of  SW-CMM  assessments,  this  definition  explicitly 
requires  two  things:  (1)  a  well-defined  population  of  assessment  cases  from  which  to  sample, 
and  (2)  a  well-defined  random  process  for  selecting  the  sample. 

The  assessments  reported  to  the  PAIS  database  do  not  satisfy  these  two  requirements.  The 
population  and  the  size  of  its  assessments  cannot  be  clearly  defined,  and  the  assessed 
organizations  are  not  selected  on  a  random  basis.  Rather,  the  assessments  in  PAIS  are  a  self- 
selected  sample  (i.e.,  assessed  organizations  that  have  voluntarily  participated  in  CBA IPI 
assessments  to  improve  their  software  processes  or  were  required  to  do  so  by  contractors.) 
Our  analyses  here  clearly  must  be  based  on  nonrandom  sampling  methods. 


7  Conte  and  colleagues  [Conte  et  al.  86]  suggest  using  a  magnitude  of  relative  error  (MRE)  measure 
of  schedule  deviation,  or  yj  =|(  Actual -Planned)/ Actual].  Stensrud  and  colleagues  [Stensrud  et  al. 

02]  prefer  a  measure  of  the  magnitude  of  error  relative  (MER),  or  y,  =|(Actual-Planned)/Planned|. 
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In  our  nonrandom  design,  the  PAIS  dataset  itself  is  a  population  of  assessment  cases,  where 
the  population  is  called  a  local  population  or  a  set  of  available  cases  [Lunneborg  00]. 
Although  the  PAIS  database  retains  the  largest  number  of  assessment  cases  available 
anywhere,  the  dataset  is  not  a  random  sample,  and  our  results  cannot  be  generalized  to  all 
SW-CMM  assessments  conducted  around  the  world.  Hence,  interpretation  of  our  results 
should  rightly  be  limited  to  assessments  reported  to  PAIS  by  the  current  base  of  CMM  users. 

Still,  it  is  sensible  to  make  inferences  about  the  descriptions  to  the  local  population.  The 
descriptions  are  not  inferences  to  a  wider  population;  rather,  they  are  descriptive  statistics 
which  can  neither  be  generalized  to  others  nor  have  causal  implications.  Typical  descriptions 
include  measures  of  central  tendency  (e.g.,  means  or  medians),  dispersion  (e.g.,  variance  or 
control  limits),  or  relationship  (e.g.,  correlation  coefficients  or  internal  consistency). 

Descriptions  based  on  a  nonrandom  sample  need  assurance  that  they  truly  characterize  the 
available  cases  and  that  they  are  stable  [Lunneborg  00,  Montgomery  et  al.  98].  An  available 
set  of  cases  such  as  our  assessment  dataset  cannot  be  assumed  to  have  the  same  degree  of 
homogeneity  as  a  random  sample.  A  fair  description  is  a  stable  one  that  is  relatively 
uninfluenced  by  the  presence  of  specific  cases.  Thus,  results  such  as  those  in  this  report 
should  be  tested  for  their  stability  (homogeneity). 
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4  Data  Analysis 


4.1  Correlation  and  Regression  Models 

Correlational  studies  have  been  used  to  investigate  whether  an  association  exists  between 
increased  capability  maturity  and  performance,  and  under  what  conditions  [Goldenson  et  al. 
99].  All  of  the  previous  studies  (except  case  studies)  reviewed  in  Section  2  are  correlational 
studies.  In  correlational  studies,  process  maturity  or  capability  and  performance  data  from  a 
large  number  of  organizations  or  projects  are  collected  and  statistically  analyzed  to  find 
relationships  between  them.  Correlational  studies  typically  compute  Pearson  or  Spearman 
correlations  or  investigate  regression  coefficients. 

Schedule  deviation,  as  we  have  defined  it,  is  limited  to  nonnegative  integer  values,  which  can 
be  called  count  outcomes.  More  than  one  regression  model  exists  for  count  outcomes  [Long 
97,  King  88],  so  it  is  necessary  to  select  an  appropriate  one.  The  selection  should  consider  the 
strengths  and  weaknesses  of  each  model  in  a  specific  application  field,  as  well  as  perceptions 
in  the  research  community  about  what  are  appropriate  models  of  count  outcomes. 

Schedule  deviation  is  a  relatively  rare  occurrence  in  our  dataset.  Many  projects  reported 
being  less  than  one  month  behind  schedule,  and  there  are  many  zero  values.  Hence,  this  study 
uses  a  zero  inflated  Poisson  (ZIP)  regression  model. 


4.2  A  Zero  Inflated  Poisson  (ZIP)  Regression  Model 

ZIP  regression  has  been  used  elsewhere  for  predicting  count  outcomes  in  software 
engineering  [Khoshgoftaar  et  al.  02].  The  ZIP  regression  model  accounts  for  the 
characteristics  of  an  excess  number  of  zero  values  on  the  dependent  variables,  which  meets 
our  current  needs  with  schedule  deviation.  Commonly  used  Pearson  or  Spearman  correlations 
are  not  sufficient  to  examine  such  an  association. 

Our  ZIP  regression  model  assumes  that  the  software  maintenance  processes  in  an  assessed 
organization  are  in  either  a  “perfect”  or  an  “imperfect”  state.  In  the  perfect  state,  no  schedule 
deviation  will  occur,  whereas  in  the  imperfect  state,  there  may  or  may  not  be  schedule 
deviation.  Several  factors  affect  the  distribution  of  schedule  deviation  in  software 
maintenance  and  the  probability  of  there  being  an  imperfect  state.  Process  maturity  is 
assumed  to  be  a  single  factor  for  the  purposes  of  this  study. 
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Let  i/fj  be  the  probability  that  the  i 01  maintenance  project  is  performed  by  a  maintenance 
process  that  is  in  a  perfect  state.  Then,  (1  —  y/t)  becomes  the  probability  that  a  process  of  the 
/ 01  maintenance  project  is  in  an  imperfect  state.  Maintenance  projects  whose  processes  are  in 
a  perfect  state  are  always  assumed  to  be  on  schedule.  Projects  whose  processes  are  in  an 
imperfect  state  may  be  on  schedule  following  a  Poisson  distribution  with  the  parameter  , 

i.e. ,  exp(-//(  ) .  For  maintenance  processes  that  are  in  an  imperfect  state,  the  probability  that 
schedule  deviation  is  greater  than  one  month  is  a  product  of  the  probability  of  being  in  an 
imperfect  state  and  the  probability  of  schedule  deviation  y-  in  a  Poisson  distribution  of  y: . 

Therefore,  the  probability  density  function  of  the  ZIP  regression  model  is  as  follows 
[Lambert  92,  Long  97]: 


Vi  +  (1  -  Vi  )exp {-Hi )  for  y,  =  0, 

y’i  I •*<) = ‘  exp (—u  )juy‘ 

(1-^,)--  ,  f°r  y,  =  1,2,... 
I  X ! 


The  conditional  mean  and  variance  of  the  ZIP  probability  function  (1)  are  //.( 1-yr) 
and  (1  -  Vi  )(1  +  MiVi ) » respectively.  If  y/  is  0,  then  the  ZIP  regression  model  (1)  becomes  a 
Poisson  regression  model.  The  term  “conditional”  is  used  to  denote  that  the  mean  and 
variance  depend  on  covariates.  The  only  covariate  in  this  study  is  maturity  level. 


The  ZIP  regression  model  is  obtained  by  the  following  two  link  functions: 

log )  =  P0  +  Px  X  M ATURIT Y_LE VEL 
(  w  -] 

logit(^)  =  log  — =y0  +  y]x  MATURITY_LEVEL 

<  ^  J 

A  negative  value  of  implies  that  a  high-maturity  maintenance  process  has  less  schedule 

deviation  than  that  of  a  low  one.  The  probability  that  the  maintenance  process  of  project  i  is 
in  a  perfect  state  is  estimated  by: 

t 


expQy  +  x  M  ATURITY_LE  VEL) 
l+exp(f0  +  x  M ATURITY_LE VEL) 


and  the  Poisson  parameter  is  estimated  by 

A  =  exp(A0  +  k  x  M ATURITY_LEVEL) 
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4.3  Stability  Examination 

Since  our  dataset  is  not  a  simple  random  sample,  we  also  need  to  examine  the  stability  of  the 
analysis  results.  For  this  purpose,  we  use  a  bootstrap8  resampling  technique  that  samples  B 
times  from  the  original  observation  with  replacement,  where  B  is  a  large  number  such  as 
1,000.  For  each  sample,  the  ZIP  regression  gives  interesting  descriptions  /?,  (coefficients  of 
maturity  level  in  the  ZIP  regression  model),  jut  (1  -  ^ )  (mean),  and  //,  (1  -  iff:  )(1  +  ) 

(variance).  Then,  the  lower  and  upper  limits  of  the  confidence  interval  of  each  description  are 
determined  at  the  2.5  and  97.5  percentiles  respectively  from  the  empirical  reference 
distribution  (i.e.,  a  histogram  of  B  replications).  The  confidence  interval  of  the  empirical 
reference  distribution  is  called  the  empirical  confidence  interval  (ECI).  The  bootstrap  method 
is  free  from  unrealistic  assumptions  such  as  normality  and  homogeneity  and  is  suitable  to 
conduct  local  inferences. 

As  noted  earlier,  we  use  the  region  of  assessed  organizations  as  a  mediating  contextual  factor. 
The  proportions  of  assessments  in  the  two  regions  are  not  fixed  in  advance;  rather,  a 
bootstrap  sample  is  drawn  with  permutation  from  the  original  dataset  and  then  is  divided  into 
the  U.S.  cases  or  non-U.S.  cases  before  computing  our  descriptions.  Each  bootstrap  sample  is 
likely  to  have  different  proportions  of  U.S.  and  non-U.S.  cases.  This  is  called  “not  by  design” 
from  the  original  dataset  [Lunneborg  00]. 

The  description  from  the  original  dataset  should  be  solidly  in  the  middle  of  the  empirical 
reference  distribution  to  be  considered  stable.  It  should  not  be  at  or  near  the  limits  of  the 
description.  A  measure  for  evaluating  stability  bias  is  defined  as  follows: 

B 

&  - 
Bias  =  — - 6 , 

B 

where  tb  is  a  value  of  the  description  at  the  b  01  subsample,  where  b =1 . B ;  and  6  is  a 

description  value  from  an  original  dataset. 

The  degree  of  bias  is  evaluated  against  the  standard  error  (SE)  of  the  description  distribution 
of  B  replicates.  The  SE  is  computed  as  follows: 


SE  = 


where  t  *  = 


If  the  bias  is  large  relative  to  the  SE,  there  is  an  instability  problem.  A  criterion  for  judgment 
is  that  if  the  absolute  value  of  the  bias  is  less  than  one-quarter  of  the  size  of  the  SE,  the  bias 

8  This  bootstrap  method  should  not  be  confused  with  the  Bootstrap  model  for  process  assessment 
[Kuvaja  99]. 
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can  be  ignored  [Efron  &  Tibshirani  93].  Hence,  a  description  from  the  original  dataset  can  be 
considered  to  be  stable. 

Bootstrap  methods  have  been  used  previously  in  empirical  software  engineering.  El-Emam 
and  Garro  estimated  the  number  of  ISO/IEC  15504  assessments  by  utilizing  a  capture- 
recapture  method  [El-Emam  &  Garro  00].  Jung  and  Hunter  utilized  a  bootstrap  method  in 
computing  confidence  levels  for  the  capability  levels  for  each  ISO/IEC  15504  process  [Jung 
and  Hunter  01].  Jung  and  Goldenson  used  a  bootstrap  resampling  method  to  evaluate  the 
stability  of  internal  consistency  in  the  SW-CMM  [Jung  &  Goldenson  02]. 
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5  Results 


5.1  Descriptive  Statistics 

This  study  is  based  on  752  maintenance  projects  from  441  SW-CMM  CBA IPI  assessments. 
Figure  4  shows  the  distribution  of  organizations  and  maintenance  projects  by  region.  A  single 
maintenance  project  was  reported  in  each  of  56  percent  of  the  assessed  organizations 
(171+76=247).  Approximately  26  percent  of  the  assessed  organizations  in  the  United  States 
included  one  maintenance  project,  while  about  34  percent  of  the  non-U.S.  organizations 
included  a  single  maintenance  project.  Two  organizations  assessed  six  maintenance  projects 
each.  The  mean,  median,  and  standard  deviation  of  the  number  of  maintenance  projects  in 
these  assessed  organizations  in  the  United  States  are  1.67,  1,  and  1.01,  respectively.  In  the 
non-U.S.  organizations,  the  mean,  median,  and  standard  deviation  of  maintenance  projects 
are  1.75,  1.5,  and  0.95,  respectively. 


Figure  5  shows  the  distribution  of  maturity  level  by  region.  If  two  or  more  maintenance 
projects  exist  in  an  assessed  organization,  the  maturity  level  is  counted  two  or  more  times. 
The  most  frequent  maturity  level  is  2  (Repeatable)  in  both  regions,  followed  by  level  3 
(Defined),  and  level  1  respectively.  Means  and  standard  deviations  are  presented  in  Table  4. 
Maturity  levels  4  and  5  are  not  considered  in  this  study  because  of  the  very  small  number  of 
maintenance  projects  that  report  delayed  schedules  at  those  levels  of  process  maturity. 
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US  dataset 


Non-US  dataset 
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Figure  5:  Distribution  of  Maturity  Level  Among  Assessed  Organizations 


The  proportion  of  organizations  at  maturity  level  2  clearly  is  not  larger  than  that  at  maturity 
level  1  in  software  industries  throughout  the  world  [Fayad  &  Laitnen  97].  More  likely,  as 
early  adopters  of  a  new  technology  and  specifically  as  organizations  interested  in  software 
process  improvement,  the  organizations  in  our  sample  are  drawn  from  the  “higher  end”  of  the 
maturity  spectrum.  This  phenomenon  has  been  detected  in  the  ISO/IEC  PDTR  15504  as  well 
[Rout  et  al.  98]. 


As  shown  in  Table  3,  the  arithmetic  mean  maturity  level  in  the  U.S.  dataset  is  nearly  equal  to 
that  in  the  non-U.S.  dataset.  But,  the  arithmetic  mean  of  schedule  deviations  in  the  U.S. 
dataset,  0.17,  is  less  than  half  the  value  of  0.38  in  the  non-U.S.  dataset. 


Table  3:  Descriptive  Statistics  of  Maturity  Level  and  Schedule  Deviation 


Maturity  level 

Schedule  deviation 

Mean 

Std  dev 

Mean 

Std  dev 

U.S.  (478) 

2.07 

0.73 

0.17 

0.99 

Non-U.S.  (274) 

2.13 

0.65 

0.38 

1.44 

Table  4  shows  the  arithmetic  mean  value  of  schedule  deviance  at  each  maturity  level.  Though 
arithmetic  means  are  subject  to  a  lack  of  robustness,  the  performance  of  schedule  deviation  is 
improved  as  maturity  level  increases  in  both  the  U.S.  and  non-U.S.  datasets. 


Table  4:  Arithmetic  Mean  of  Schedule  Deviation  at  Each  Maturity  Level 


Maturity  level  1 

Maturity  level  2 

Maturity  level  3 

Mean 

Std  dev 

Mean 

Std  dev 

Mean 

Std  dev 

U.S. 

0.464 

1.750 

0.086 

0.622 

0.069 

0.468 

Non-U.S. 

0.643 

2.070 

0.407 

1.463 

0.195 

0.828 
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5.2  Analysis  Results 


5.2.1  Parameter  Estimation  and  Stability  Test 

The  results  of  our  ZIP  regression  analyses  are  given  in  Table  5.  As  expected,  the  estimated 
coefficient  of  maturity  level,  $ ,  is  both  negative  and  statistically  significant  for  both  the  U.S. 
and  non-U.S.  datasets.  The  negative  association  indicates  that  schedule  deviance  decreases 
across  the  maintenance  projects  as  their  respective  organizations’  maturity  levels 
progressively  increase.  This  is  consistent  with  the  hypothesis  in  our  theoretical  model. 


Table  5:  ZIP  Regression  Results  of  Schedule  Deviation 


U.S. 

Non-U.S. 

Estimated 

One-sided 

p-value 

Estimated 

One-sided 

p-value 

Intercept  ( y0 ) 

1.625 

0.002 

1.376 

0.007 

MATURITY_LEVEL  (  Yi ) 

0.  682 

0.008 

0.  278 

Intercept  ( /?0 ) 

1.  886 

0 

1.841 

0 

Maturity jlevel  ( /?, ) 

-0. 428 

0.004 

Goodness-of-fit 

z2<i 

1 

%l=  9.  558 

0.047 

In  addition,  the  log  ratio  of  perfect  to  imperfect  state,  log [y/i  /(1-yr.)] ,  has  a  positive 
association  ( y{  >0)  with  maturity  level.  The  ratio  for  the  non-U.S.  dataset  is  significant  at 
8.9%,  which  indicates  only  a  weak  association;  however,  the  results  for  both  regions  indicate 
that  the  probability  of  being  in  a  perfect  state  is  increased  as  maturity  progressively  increases. 

The  Chi-square  goodness-of-fit  values9  in  the  last  row  in  Table  5  show  the  aptness  of  our  ZIP 
regression  model  [Cameron  &  Trivedi  98].  Each  of  the  two  fitted  models  conforms  to  the 
assumptions  of  the  ZIP  regression  model  at  an  alpha  value  of  1  percent. 

Figure  6  shows  a  graph  comparing  the  fitted  and  actual  probabilities  for  the  non-U.S.  case. 
The  better  the  fit  is,  the  smaller  the  difference  of  probabilities.  Figure  6  shows  that  the 
number  of  one-month  deviation  projects  is  slightly  underestimated.  On  the  other  hand,  the 
number  of  projects  with  three-  and  four-month  deviations  is  slightly  overestimated.  However, 
all  of  the  differences  are  negligible.  For  the  U.S.  dataset,  the  plot  is  omitted  because  the 
difference  of  fitted  and  actual  probabilities  is  quite  small. 


9  A  null  hypothesis  for  Chi-square  goodness-of-fit  is  that  no  difference  exists  between  actual  counts 
and  estimated  counts,  i.  e. ,  X2  =  iQt  -  E:  )2  /  Ei ,  where  Oi  and  Et  are  observed  and  estimated 

i 

schedule  deviation,  respectively  (  Et  >  5  ).  Thus,  a  large  statistic  and  small  p-value  implies  a  poor 
model  fit.  The  p- value  is  a  right-tail  probability  [Cameron  &  Trivedi  98]. 
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Figure  6:  A  Plot  of  Actual  and  Estimated  Probabilities 


The  estimated  coefficients  # ’s  are  most  directly  related  to  our  hypotheses,  and  they  are 
examined  for  their  stability  here.  Figure  7  shows  a  bootstrap  distribution  of  the  estimated 
coefficients  ’s  of  schedule  deviation  with  1,000  replicates.  For  the  U.S.  dataset  (on  the  left 
in  Figure  7),  the  dotted  and  solid  vertical  lines  denote  a  bootstrap  coefficient  of -0.463  and  an 
observed  coefficient10  of  -0.428  respectively.  The  difference  between  them,  -0.035,  is  defined 
as  a  bias  in  bootstrap  sampling.  It  is  ignorable  in  comparison  with  the  SE  value  of  0.299. 
Therefore,  we  conclude  that  the  estimated  coefficient  ji{  of  maturity  level  is  stable.  In  the 
bootstrap  distribution,  97  percent  of  the  estimated  coefficients  have  negative  values. 

For  the  non-U.S.  dataset,  the  bias  of  the  maturity  level  coefficient,  -0.363  -  (-0.364)  =  0.001, 
is  also  ignorable  in  comparison  with  the  SE  value  of  0.314.  However,  89  percent  of  the 
estimates  of  the  maturity  level  coefficient  /?,  are  negative.  This  is  a  relatively  high  value 
compared  to  the  p-value  of  0.009  in  Table  5. 


The  term  obser\’ed  implies  the  sample  in  our  dataset,”  i.e.,  the  estimated  value  from  our  original 
dataset. 
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Figure  7:  Bootstrap  Distribution  of  Estimated  Coefficients  /?, 


5.2.2  Mean  and  Its  Stability 

As  seen  in  Table  5,  the  negative  coefficients  /?,  of  process  maturity  support  the  hypothesis 

that  increases  in  maturity  level  result  in  decreases  in  schedule  deviation.  Figure  8  shows  the 
evaluation  of  HYPOTHESIS  1  in  fuller  detail.  The  (expected)  mean  (1  -  )  of  probability 

density  function  (1)  is  decreasing,  and  the  decrease  is  distinct  for  both  the  U.S.  and  non-U.S 
datasets. 
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Maturity  level 


Figure  8:  Mean  of  Schedule  Deviation  at  Maturity  Levels  1-3 

Table  6  shows  results  of  the  bootstrap  resampling  that  examines  the  stability  of  the  expected 
mean  at  each  capability  level.  We  conclude  that  mean  at  each  level  is  stable  because  the  bias 
is  smaller  than  one  quarter  of  the  SE. 


Table  6:  Bootstrap  Results  of  Mean  Schedule  Deviation 


Region 

Maturity 

Level 

Observed 

Mean 

Bootstrap 

Mean 

Bias 

SE 

95%  ECI 

U.S. 

i 

0.389 

0.390 

0.001 

2 

0.134 

0.127 

-0.007 

3 

0.045 

0.047 

0.002 

Non-U. S. 

1 

0.703 

0.707 

0.004 

0.265 

[0.270.1.273] 

2 

0.385 

0.374 

-0.011 

3 

0.209 

0.216 

0.007 

0.084 

[0.078, 0.389] 

The  observed  means  are  a  result  of  the  sample  in  our  dataset.  Different  samples  would 
produce  different  mean  values.  Hence,  a  confidence  interval  is  employed  to  delimit  the  true 
(unknown)  mean  value  of  schedule  deviation  at  each  maturity  level.  The  95%  ECI  in  Table  6 
is  computed  from  Figure  9,  which  is  a  bootstrap  empirical  reference  distribution  of  mean 
schedule  deviation  with  1,000  replicates. 

As  an  example  of  the  ECI  interpretation,  we  can  say  with  a  confidence  of  95  percent  that 
mean  schedule  deviation  at  maturity  level  1  in  the  U.S.  dataset  is  somewhere  in  the  interval 
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between  0.134  and  0.689.  But  this  interpretation  is  limited  to  the  current  dataset;  it  cannot  be 
extended  to  all  industries  in  the  United  States.  Since  the  bootstrap  empirical  reference 
distribution  in  Figure  9  does  not  satisfy  a  normality  assumption,  using  a  bootstrap  ECI  is 
justified. 

Note  that  there  are  long  tails  on  the  right-hand  side  of  the  non-U.S.  distributions.  They  are 
truncated  for  reasons  of  space.  In  Figure  9,  however,  the  same  basic  results  hold  for  both  the 
U.S.  and  non-U.S.  data. 

The  95%  ECIs  among  the  maturity  levels  in  Table  6  partially  overlap  each  other.  The 
empirical  reference  distributions  in  Figure  9  also  show  that  overlap.  Hence,  we  must  test 
whether  there  is  a  significant  difference  in  the  mean  schedule  deviation  between  maturity 
levels.  The  empirical  reference  distributions  in  Figure  9  clearly  indicate  that  we  cannot 
employ  a  parametric  test  to  examine  the  mean  differences;  however,  the  bootstrap  method 
shows  that  there  are  statistically  significant  difference  of  mean  schedule  deviation  between 
maturity  levels  1  and  2  and  levels  2  and  3  with  a  p-vale  of  0.005  for  the  both  cases; 
corresponding  p-values  of  0.04  and  0.039  show  that  there  also  are  significant  differences  in 
mean  schedule  deviation  for  the  same  two  pairs  of  maturity  levels  in  the  non-U.S.  dataset. 
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Figure  9:  Bootstrap  Distribution  for  Mean  Schedule  Deviation 
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5.2.3  Variance  and  Its  Stability 

Our  second  hypothesis  requires  us  to  evaluate  the  reduction  of  variance  in  schedule  deviation 
with  respect  to  maturity  level.  Figure  10  shows  how  the  conditional  variance  of  the  ZIP 
probability  density  function  ( 1 ),  //,  (1  -  y/i  )(1  +  p.y,. ) ,  is  reduced  with  respect  to  maturity 

level.  Again,  the  reduction  in  variance  is  significant. 
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Figure  10:  Variance  of  Schedule  Deviation  at  Maturity  Levels  1-3 

The  results  of  our  bootstrap  resampling  shown  in  Table  7  show  that  the  bias  is  less  than  a 
quarter  of  the  SE.  The  estimated  value  of  variance  in  schedule  deviation  also  is  stable  at  each 
maturity  level. 


Table  7:  Bootstrap  Results  of  Variance  of  Schedule  Deviation 


Region 

Maturity 

Level 

Observed 

Mean 

Bootstrap 

Mean 

Bias 

SE 

95%  ECI 

U.S. 

1 

1.910 

1.961 

0.051 

0.947 

[0.523, 4.100] 

2 

0.492 

0.464 

•0.028 

0.188 

[0.159, 0.905] 

3 

0.126 

0.138 

0.012 

0.099 

[0.018.  0.370] 

Non-U.S. 

1 

3.289 

3.490 

0.201 

1.909 

[0.839, 7.958] 

2 

1.409 

1.370 

-0.039 

0.447 

[0.647, 2.315] 

3 

0.608 

0.684 

0.076 

0.575 

[0.133,  1.511] 

Finally,  the  95%  ECIs  of  conditional  variance  in  Table  7  also  are  partially  overlapped.  The 
empirical  reference  distributions  in  Figure  1 1  lead  to  the  same  conclusion.  Therefore,  we  can 
use  the  bootstrap  empirical  reference  distributions  in  Figure  11  to  evaluate  the  variance 
difference  in  the  schedule  deviation  between  maturity  levels.  In  the  U.S.  dataset,  95  percent 
of  the  1,000  replicates  show  that  the  variance  in  schedule  deviance  at  maturity  level  2  is  less 
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6  Conclusion 


This  study  presents  compelling  evidence  about  the  predictive  validity  of  the  SW-CMM  as 
applied  to  software  maintenance.  A  basic  premise  of  the  SW-CMM  is  that  higher  maturity 
should  result  in  better  project  performance.  We  find  that  assessed  maturity  level  is  in  fact 
related  as  expected  to  schedule  deviation  in  software  maintenance  projects,  and  our  results 
are  quite  robust,  in  spite  of  the  limitations  of  the  data.  While  important  distinctions  remain  to 
be  addressed,  the  results  are  similar  across  the  software  development  life  cycle;  they  do  not 
appear  to  be  limited  to  maintenance  projects. 

A  univariate  ZIP  regression  model  is  employed  to  test  the  premise.  Since  the  results  are  based 
on  non-random  sampling,  they  are  validated  using  a  bootstrap  estimation  method. 

The  results  show  that  maintenance  projects  in  higher  maturity  organizations  typically  have 
lower  mean  and  variance  in  schedule  deviation  than  do  comparable  projects  from 
organizations  assessed  at  lower  levels  of  maturity.  The  schedule  estimates  of  projects  from 
higher  maturity  organizations  are  markedly  more  predictably  accurate. 

Clearly,  organizational  maturity  is  not  the  only  factor  that  affects  schedule  deviation  in 
software  maintenance  projects.  Neither  is  schedule  deviation  the  only  performance  measure 
worth  considering.  Other  measures  of  performance  such  as  cost,  productivity,  quality,  and 
customer  satisfaction  should  be  evaluated  in  future  analyses  of  the  predictive  validity  of 
Capability  Maturity  Modeling®.  Moreover,  such  analyses  should  be  extended  to  CMM 
Integration  and  the  full  life  cycle  of  the  development,  maintenance,  and  acquisition  of 
software-intensive  systems. 


Capability  Maturity  Modeling  is  registered  in  the  U.S.  Patent  and  Trademark  Office  by  Carnegie 
Mellon  University. 

M  CMM  Integration  is  a  service  mark  of  Carnegie  Mellon  University. 
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